Goto

Collaborating Authors

 temporal relation




The Temporal Game: A New Perspective on Temporal Relation Extraction

arXiv.org Artificial Intelligence

In this paper we demo the Temporal Game, a novel approach to temporal relation extraction that casts the task as an interactive game. Instead of directly annotating interval-level relations, our approach decomposes them into point-wise comparisons between the start and end points of temporal entities. At each step, players classify a single point relation, and the system applies temporal closure to infer additional relations and enforce consistency. This point-based strategy naturally supports both interval and instant entities, enabling more fine-grained and flexible annotation than any previous approach. The Temporal Game also lays the groundwork for training reinforcement learning agents, by treating temporal annotation as a sequential decision-making task. To showcase this potential, the demo presented in this paper includes a Game mode, in which users annotate texts from the TempEval-3 dataset and receive feedback based on a scoring system, and an Annotation mode, that allows custom documents to be annotated and resulting timeline to be exported. Therefore, this demo serves both as a research tool and an annotation interface. The demo is publicly available at https://temporal-game.inesctec.pt, and the source code is open-sourced to foster further research and community-driven development in temporal reasoning and annotation.


Deep Temporal Reasoning in Video Language Models: A Cross-Linguistic Evaluation of Action Duration and Completion through Perfect Times

arXiv.org Artificial Intelligence

Human perception of events is intrinsically tied to distinguishing between completed (perfect and telic) and ongoing (durative) actions, a process mediated by both linguistic structure and visual cues. In this work, we introduce the \textbf{Perfect Times} dataset, a novel, quadrilingual (English, Italian, Russian, and Japanese) multiple-choice question-answering benchmark designed to assess video-language models (VLMs) on temporal reasoning. By pairing everyday activity videos with event completion labels and perfectivity-tailored distractors, our dataset probes whether models truly comprehend temporal dynamics or merely latch onto superficial markers. Experimental results indicate that state-of-the-art models, despite their success on text-based tasks, struggle to mirror human-like temporal and causal reasoning grounded in video. This study underscores the necessity of integrating deep multimodal cues to capture the nuances of action duration and completion within temporal and causal video dynamics, setting a new standard for evaluating and advancing temporal reasoning in VLMs.


Interpretable Traces, Unexpected Outcomes: Investigating the Disconnect in Trace-Based Knowledge Distillation

arXiv.org Artificial Intelligence

Question Answering (QA) poses a challenging and critical problem, particularly in today's age of interactive dialogue systems such as ChatGPT, Perplexity, Microsoft Copilot, etc. where users demand both accuracy and transparency in the model's outputs. Since smaller language models (SLMs) are computationally more efficient but often under-perform compared to larger models, Knowledge Distillation (KD) methods allow for finetuning these smaller models to improve their final performance. Lately, the intermediate tokens or the so called `reasoning' traces produced by Chain-of-Thought (CoT) or by reasoning models such as DeepSeek R1 are used as a training signal for KD. However, these reasoning traces are often verbose and difficult to interpret or evaluate. In this work, we aim to address the challenge of evaluating the faithfulness of these reasoning traces and their correlation with the final performance. To this end, we employ a KD method leveraging rule-based problem decomposition. This approach allows us to break down complex queries into structured sub-problems, generating interpretable traces whose correctness can be readily evaluated, even at inference time. Specifically, we demonstrate this approach on Open Book QA, decomposing the problem into a Classification step and an Information Retrieval step, thereby simplifying trace evaluation. Our SFT experiments with correct and incorrect traces on the CoTemp QA, Microsoft Machine Reading Comprehension QA, and Facebook bAbI QA datasets reveal the striking finding that correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness. These results challenge the implicit assumption behind utilizing reasoning traces for improving SLMs' final performance via KD.


Salient Temporal Encoding for Dynamic Scene Graph Generation

arXiv.org Artificial Intelligence

Representing a dynamic scene using a structured spatial-temporal scene graph is a novel and particularly challenging task. To tackle this task, it is crucial to learn the temporal interactions between objects in addition to their spatial relations. Due to the lack of explicitly annotated temporal relations in current benchmark datasets, most of the existing spatial-temporal scene graph generation methods build dense and abstract temporal connections among all objects across frames. However, not all temporal connections are encoding meaningful temporal dynamics. We propose a novel spatial-temporal scene graph generation method that selectively builds temporal connections only between temporal-relevant objects pairs and represents the temporal relations as explicit edges in the scene graph. The resulting sparse and explicit temporal representation allows us to improve upon strong scene graph generation baselines by up to $4.4\%$ in Scene Graph Detection. In addition, we show that our approach can be leveraged to improve downstream vision tasks. Particularly, applying our approach to action recognition, shows 0.6\% gain in mAP in comparison to the state-of-the-art


ExpliCa: Evaluating Explicit Causal Reasoning in Large Language Models

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are increasingly used in tasks requiring interpretive and inferential accuracy. In this paper, we introduce ExpliCa, a new dataset for evaluating LLMs in explicit causal reasoning. ExpliCa uniquely integrates both causal and temporal relations presented in different linguistic orders and explicitly expressed by linguistic connectives. The dataset is enriched with crowdsourced human acceptability ratings. We tested LLMs on ExpliCa through prompting and perplexity-based metrics. We assessed seven commercial and open-source LLMs, revealing that even top models struggle to reach 0.80 accuracy. Interestingly, models tend to confound temporal relations with causal ones, and their performance is also strongly influenced by the linguistic order of the events. Finally, perplexity-based scores and prompting performance are differently affected by model size.


CognTKE: A Cognitive Temporal Knowledge Extrapolation Framework

arXiv.org Artificial Intelligence

Reasoning future unknowable facts on temporal knowledge graphs (TKGs) is a challenging task, holding significant academic and practical values for various fields. Existing studies exploring explainable reasoning concentrate on modeling comprehensible temporal paths relevant to the query. Yet, these path-based methods primarily focus on local temporal paths appearing in recent times, failing to capture the complex temporal paths in TKG and resulting in the loss of longer historical relations related to the query. Motivated by the Dual Process Theory in cognitive science, we propose a \textbf{Cogn}itive \textbf{T}emporal \textbf{K}nowledge \textbf{E}xtrapolation framework (CognTKE), which introduces a novel temporal cognitive relation directed graph (TCR-Digraph) and performs interpretable global shallow reasoning and local deep reasoning over the TCR-Digraph. Specifically, the proposed TCR-Digraph is constituted by retrieving significant local and global historical temporal relation paths associated with the query. In addition, CognTKE presents the global shallow reasoner and the local deep reasoner to perform global one-hop temporal relation reasoning (System 1) and local complex multi-hop path reasoning (System 2) over the TCR-Digraph, respectively. The experimental results on four benchmark datasets demonstrate that CognTKE achieves significant improvement in accuracy compared to the state-of-the-art baselines and delivers excellent zero-shot reasoning ability. \textit{The code is available at https://github.com/WeiChen3690/CognTKE}.


EventFull: Complete and Consistent Event Relation Annotation

arXiv.org Artificial Intelligence

MEANTIME (Minard et al., 2016), and EventStoryLine Identifying the semantic relations between events (Caselli and Vossen, 2017) restrict event mentioned in a text, notably temporal, causal and pairs to a span of two consecutive sentences. This coreference relations, has been a fundamental goal limitation inherently prevents testing and training in NLP. Substantial efforts have been devoted to developing models on longer-range relations. Other datasets, various datasets that capture some or all of such as TimeBank (Pustejovsky et al., 2003b) and these relations (O'Gorman et al., 2016; Hong et al., MAVEN-ERE (Wang et al., 2022), did not publish 2016; Wang et al., 2022). These datasets were then a systematic annotation execution protocol that leveraged to develop and to evaluate corresponding guarantees actual complete annotation, and were models for detecting event-event relations (Hu subsequently criticized for being incomplete in et al., 2023; Guan et al., 2024). The output of such their relation annotation (Pustejovsky and Stubbs, models has been utilized in a range of downstream 2011; Rogers et al., 2024). Further, some researchers applications, with recent examples including event aimed to avoid the cost of manual annotation forecasting (Ma et al., 2023), misinformation detection altogether and employed fully-or partlyautomatic (Lei and Huang, 2023), and treatment timeline dataset creation methods (Mirza et al., extraction (Yao et al., 2024), among others.


Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?

arXiv.org Artificial Intelligence

The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering. Nevertheless, recent studies have tested the LLMs' performance in detecting temporal relations of closed-source models only, limiting the interpretability of those results. In this work, we investigate LLMs' performance and decision process in the Temporal Relation Classification task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models can be found respectively on GitHub.